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In the computational-mechanics structural analysis of one-dimensional cellular automata the 
following automata-theoretic analogue of the change-point problem from time series analysis 
arises: Given a string a and a collection {Vi} of finite automata, identify the regions of a that 
belong to each Vi and, in particular, the boundaries separating them. We present two methods 
for solving this multi-regular language filtering problem. The first, although providing the ideal 
solution, requires a stack, has a worst-case compute time that grows quadratically in a's length 
and conditions its output at any point on arbitrarily long windows of future input. The second 
method is to algorithmically construct a transducer that approximates the first algorithm. In 
contrast to the stack-based algorithm, however, the transducer requires only a finite amount of 
memory, runs in linear time, and gives immediate output for each letter read; it is, moreover, the 
best possible finite-state approximation with these three features. 
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I. INTRODUCTION 

Imagine you are confronted with an immense one- 
dimensional dataset in the form of a string a of let- 
ters from a finite alphabet S. Suppose moreover that 
you discover that vast expanses of a are regular in 
the sense that they are recognized by simple finite au- 
tomata Vi, . . . ,Vn. You might wish to bleach out these 
regular substrings so that only the boundaries sep- 
arating them remain, for this reduced presentation 
might illuminate a's more subtle, larger-scale struc- 
ture. 

This multi-regular language filtering problem is the 
automata-theoretic analogue of several, more statis- 
tical, problems that arise in a wide range of dis- 
ciplines. Examples include estimating stationary 
epochs within time series (known as the change-point 
problem 1 1]), distinguishing gene sequences and pro- 
moter regions from enveloping junk DNA |2], detect- 
ing phonemes in sampled speech |3], and identifying 
regular segments within line-drawings to men- 
tion a few. 

The multi-regular language filtering problem arises 
directly in the computational-mechanics structural 
analysis of cellular automata 1 5]. There, finite 
automata recognizing temporally invariant sets of 
strings are identified and then filtered from space- 
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time diagrams to reveal systems of particles whose 
interactions capture the essence of how a cellular au- 
tomaton processes spatially distributed information. 

We present two methods for solving the multi- 
regular language filtering problem. The first cov- 
ers (T with maximal substrings recognized by the au- 
tomata {Vi}. The interesting parts of a are then 
located where these segments overlap or abut. Al- 
though this approach provides the ideal solution to 
the problem, it unfortunately requires an arbitrarily 
deep stack to compute, has a worst-case compute time 
that grows quadratically in a's length, and conditions 
its output at any point on arbitrarily long windows 
of future input. As a result, this method becomes 
extremely expensive to compute for large data sets, 
including the expansive space-time diagrams that re- 
searchers of cellular automata often scrutinize. 

The second method — and our primary focus — is to 
algorithmically construct a finite transducer that ap- 
proximates the first, stack-based algorithm by print- 
ing sequences of labels i over segments of a recognized 
by the automaton V^. When, at the end of such a seg- 
ment, the transducer encounters a letter forbidden by 
the prevailing automaton Vi, it prints special symbols 
until it resynchronizes to a new automaton Vj . In this 
way, the transducer approximates the stack-based al- 
gorithm by jumping from one maximal substring to 
the next, printing a few special symbols in between. 
Since it does not jump to a new maximal substring 
until the preceding one ends, however, the transducer 
can miss the true beginning of any maximal substring 
that overlaps with the preceding one. Typically, the 
benefits of the finite transducer outweigh the occur- 
rence of such errors. 
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In contrast with the stack-based algorithm it ap- 
proximates, however, the transducer requires only 
a finite amount of memory, runs in linear time, 
and gives immediate output for each letter read — 
significant improvements for cellular automata struc- 
tural analysis and, we suspect, for other applica- 
tions as well. Put more precisely, the transducer 
is Lipschitz-continuous (with Lipschitz constant one) 
under the cylinder-set topology, whereas the stack- 
based algorithm, which conditions its output on arbi- 
trarily long windows of future input, is generally not 
even continuous. 

It is also worth noting that the transducers thus 
produced are the best possible approximations with 
these three features and are identical to those that re- 
searchers have historically constructed by hand. Our 
algorithm thus relieves researchers of the tedium of 
constructing ever more complicated transducers. 

Cellular Automata 

Before presenting our two filtering methods, we in- 
troduce cellular automata in order to highlight an im- 
portant setting where the multi-regular language fil- 
tering problem arises, as well as to give some visual 
intuition to our approach. 

Let S be a discrete alphabet of k symbols. A local 
update rule of radius r is any function : 1;^'"+^ 
S. Given such a function, we can construct a global 
mapping of bi-infinite strings $ : ^ E^, called a 
one-dimensional cellular automaton (CA), by setting: 

$(CT)i := (/i(cri_r . . . cr,; . . . cr,;+r) , 

where ai denotes the ith letter of the string a. Since 
the image under $ of any period-iV bi-infinite string 
also has period N, it is common to regard $ as a map- 
ping of finite strings, E^ ^ E^. When regarded in 
this way, a CA is said to have periodic boundary con- 
ditions. 

For k=2 and r=l, there are precisely 256 local up- 
date rules, and the resulting CAs are called the el- 
ementary CAs (or EGAs). Wolfram |6] introduced a 
numbering scheme for them: Order the neighbor- 
hoods E-^ lexicographically and interpret the symbols 
{(pirj) : 77 G E^} as the binary representation of an 
integer between and 255. 

By interpreting a string's letters as values assumed 
by the sites of a discrete lattice, a CA can be viewed 
as a spatially extended dynamical system — discrete 
in time, space, and local state. Its behavior as such 
is often illustrated through so-called space-time di- 
agrams, in which the iterates {^\cr°)}t=aA,2,... of an 
initial string ct" are plotted as a function of time. Fig- 
ure[2 for example, depicts EGA 110 acting iteratively 
on an initial string of length N = 150. 

Due to their appealingly simple architecture, re- 
searchers have studied CAs not only as abstract math- 




1 Site 150 

FIG. 1: A space-time diagram ifiustrating the typical behav- 
ior of EGA 110. Black squares correspond to Is, and white 
squares to Os. 

ematical objects, but as models for physical, chemical, 
biological, and social phenomena such as fluid flow, 
galaxy formation, earthquakes, chemical pattern for- 
mation, biological morphogenesis, and vehicular traf- 
fic dynamics. Additionally, they have been used as 
parallel computing devices, both for the high-speed 
simulation of scientific models and for computational 
tasks such as image processing. More generally, CAs 
have provided a simplified setting for studying the 
"emergence" of cooperative or collective behavior in 
complex systems. The literature for all theseappli- 
cations is vast and includes Refs. llTlHlglllollTl fll 

[i^[ii[i5inii. 



Computational-Mechanics Structural Analysis of 
Cas 

The computational-mechanics iIitI Ii3| structural 
analysis of a CA rests on the discovery of a "pat- 
tern basis" — a collection {PJ of automata that de- 
scribe the emergent structural components in the CA's 
space-time behavior |19, 20]. Once such a pattern ba- 
sis is found, conforming regions of space-time can be 
seen as background domains through which coherent 
structures not fitting the basis move. In this way, 
structural features set against the domains can be 
identified and analyzed. 

More formally, Crutchfield and Hanson define a reg- 
ular domain 2? to be a regular language (the collection 
of strings recognized by some finite automaton) that 
is: 

1. temporally invariant — the CA maps V onto it- 
self; that is, $" [V] = V for some n > — and 

2. spatially homogeneous — the same pattern can 
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FIG. 2: (Left) Space-time diagram illustrating the typical behavior of EGA 18 — a CA exhibiting apparently random behavior, 
i.e., the set of length-i spatial strings has a positive entropy density as i ^ oo. (Right) The same space-time diagram 
filtered with the regular domain V = sub ([0(0 + 1)]*). (After Ref I 2L1 .) 



occur at any letter: the recurrent states in 
the minimal finite automaton recognizing V are 
strongly connected. 

Once we discover a CA's regular domains — either 
through visual inspection or by an automated induc- 
tion method such as the e-machine reconstruction 
algorithm |5] — the corresponding space-time regions 
are, in a sense, understood. Given this level of discov- 
ered regularity, we bleach out the domain-conforming 
regions from space-time diagrams, leaving only "un- 
modeled" deviations, whose dynamics can then be 
studied. Sometimes, as is the case for the CAs we ex- 
hibit here, these deviations resemble particles and, 
by studying the characteristics of these particle-like 
deviations — how they move and what happens when 
they collide, we hope to understand the CA's (possibly 
hidden) computational capabilities. 

Consider, for example, the apparently random 
behavior of EGA 18, illustrated in Fig. |2l Al- 
though no coherent structures present themselves to 
the eye, computational-mechanics structural analysis 
lays bare particles hidden within its output: Filter- 
ing its space-time diagrams with the regular domain 
V = sub([0(0+l)]*) — where sub(£) denotes the regular 
language consisting of all subwords of strings belong- 
ing to the regular language C — reveals a system of 
particles that follow random walks and pairwise an- 
nihilate whenever they touch |20, 21, 22]. Thus, by 
blurring the CA's deterministic behavior on strings, 
we discover higher-level stochastic particle dynamics. 
Although this loss of deterministic detail may at first 
seem conceptually unsatisfying, the resulting view is 
more structurally detailed than the vague classifica- 
tion of EGA 18 as "chaotic". 

Thus, discovering domains and filtering them from 



space-time diagrams is essential to understanding 
the information processing embedded within a CA's 
output. 

II. METHOD 1— FILTERING WITH A STACK 

We now present the first method for solving the 
general multi-regular language filtering problem with 
which we began. Although the following method is 
perhaps the most thorough and easiest to describe, 
it requires an arbitrarily deep stack to compute. Its 
description will rest upon a few basic ideas from au- 
tomata theory. (Please refer to the first few para- 
graphs of App. up to and including Lemma |2l 
where these preliminaries are reviewed.) 

To filter a string a, this method identifies the col- 
lection of its maximal substrings that the automata 
{Vi} accept. More formally, given a string a, let a a, b 
denote the substring CTafXa+i ••• o"b for a, 6 e Z. If a is 
bi-infinite, extend this notation so that a = — oo and 
h ~ CO denote the intuitive infinite substrings. Place 
a partial ordering -< on all such substrings by setting 
<ya.b < oa'.w if a' <a<h<h'. Then let Pmax({A},fT) 
denote the collection of maximal substrings Oa^h (with 
respect to <) that the {2?^} accept — or, in symbols, let: 

T'maxllA}, cr) := {oa.h £ V I there is no a' e 7^ 

with Oa.h ■, 

where V := {(Ja.b ■ A accepts aa,b for some i}. 
The following algorithm can be used to compute 

7'max({A},f7). 

Algorithm 1. Input: The automata Vi,. . . ,Vn and 
the length-N string a. 
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Let A := Det(I?i U • • • U D„)- 
Let sq be A's unique start state. 
Let S and M be empty stacks. 
For j = 1...N do 
Push (so, j) onto S. 
For each (s, i) e S do 

If there is a transition (s, crj,s') e T{A) 
then replace {s, i) with (s', i) in S. 
Otherwise, remove (s, i) from S. 
If, in addition, (s, i) was at the bottom ofS 
then push the pair - 1) onto M. 
Let (s/,i/) be the pair at the bottom ofS. 
Push {if,N) onto M. 
Output: M. 

The following proposition is easily verified, and we 
state it without proof. 

Proposition 1. If a is a finite string and ifM^ is the 
output of the above algorithm when applied to a, then 

7'max({2?»},ff) = {oa.h : (a, 6) e M^}. 

We summarize Prop. [2 by saying that Algorithm [J 
solves the local filtering problem in the sense that it 
can compute Vms.y.{\Pi\,w) over a finite, contractible 
window w. (By contractible we mean that periodic 
boundary conditions along the boundary of w are ig- 
nored.) 

The global filtering problem, which takes into ac- 
count periodic boundary conditions, is considerably 
more subtle. A somewhat pedantic example is filter- 
ing the bi-infinite string 0^ consisting entirely of Os 
with the language sub[(0'"l)*]. (Recall that sub(£) is 
our notation for the collection of substrings of strings 
belonging to C.) The local approach applied to a fi- 
nite length- window 0^, where N < m, will re- 
turn 0^ itself as its single maximal substring; i.e., 
7'max({sub[(0™l)*]},0^) = {0^}. In contrast, the 
global filter of 0^ will consist of heavily overlapping 
length-m substrings beginning and ending at every 
position within 0^: 

7'„.ax({sub[(0"l)*]}, 0^) = • ■ • Of+,„ : a e Z}. 

Fortunately, by examining sufficiently large finite 
windows. Algorithm [J can also be used to solve this 
more subtle global filtering problem in the case of a bi- 
infinite string that is periodic. The following Lemma 
captures the essential observation. 

Lemma 1. Suppose a is a period- bi-infinite string. 

Then every maximal substring Ua^b £ 'Pmax({2'i}, ct) 
must have length < m ■ N, where m max{|5(2?i)|}i, 
or else Vma^U'Di} , cr) must consist of a ^00,00 = o", alone. 

Proof Our argument is a variation on the proof of 
the classical Pumping Lemma from automata the- 
ory. Suppose that (7a,b e 'Pmax({^5i},o'), a and b are 
finite, and b — a + I > m ■ N. Then one of the do- 
mains, say Pj, accepts da b. By definition, this means 



there is a sequence of transitions in T[Vi) of the form 
(sa, CTa, Sa+i), (sq+i , (Ta+i , Sa+2 ) , • ■ • , (sb, ffc, Sf,+i). Con- 
sider the sequence of pairs: 

{(s„imod iV)}ta C X Zjv. 

Since: 

b- a+\> m- N > x Zat], 

the Pigeonhole Principle implies that this sequence 
must repeat — say (s; , I mod N) — (s;/ , I' mod N) for in- 
tegers I < v. But then Vi must also accept any string 
of the form: 

Since I mod N = V mod N, such strings correspond to 
arbitrarily long substrings of the original bi-infinite 
string (7. As a result, Ua.b cannot be maximal. This 
contradiction implies that either (i) a and b are not 
both finite or (ii) b — a+\<m-N.A. straightforward 
generalization of our argument in fact shows that ei- 
ther (i) both a and b are infinite or (ii) b — a + 1 < 
m- N. □ 

A consequence of Lemma[2is that we can solve the 
global filtering problem by applying Algorithm [J to a 
window of length mN + 1. 

Proposition 2. Suppose a is a period-A^ bi-infinite 
string and that M^' is the output ofAlgorithm\^when 
applied to the finite string a' := <7ia2 ■ ■ ■ <j„in+i, where 
m := max{|5(A)|}j. Then: 

'PmeiA{Vi},a) = {(Ja+qN,b+qN ■ [a,b) G M^-/ , g £ Z} , 

unless Mct' consists of{l, mN+1) alone, in which case 

7'max({2'i},cr) = {o--oo,oo = O"}- 

The major drawback of Algorithm [J however, is its 
worst-case compute time. 

Proposition 3. The worst-case performance of the 
stack-based filtering algorithm (Algorithm E) has or- 
der 0{N^), where N is the length of the input string g. 

Proof. For each i ~ 1 . . . iV, the algorithm pushes a 
new pair (so,j) onto the stack S and then advances 
each pair on S. In the case that A accepts the entire 
string (T, the algorithm will never remove any pairs 
from S and will thus advance a total of Y^j=\ J = 
iiV(A^ + 1) pairs. The proposition follows since it is 
possible to advance each pair in constant time. □ 

III. METHOD 2— FILTERING WITH A 
TRANSDUCER 

The second method — and our primary focus — is to 
algorithmically construct a finite transducer that ap- 
proximates the stack-based Algorithm [J by printing 
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sequences of labels i over segments of cr recognized by 
the automaton Vi . When, at the end of such a seg- 
ment, the transducer encounters a letter forbidden by 
the prevailing automaton I?, , it prints special sym- 
bols until it resynchronizes to a new automaton Vj. 
The special symbols consist of labels for the kinds of 
domain-to-domain transition and A, which indicates 
that classification is ambiguous. 

In this way, the transducer approximates the stack- 
based algorithm by jumping from one maximal sub- 
string to the next, printing a few special symbols in 
between. Because it does not jump to a new maximal 
substring until the preceding one ends, however, the 
transducer can miss the true beginning of any max- 
imal substring that overlaps with the preceding one. 
But if no more than two maximal substrings overlap 
at any given point of cr, then it is possible to com- 
bine the output of two transducers, one reading left- 
to-right and the other reading right-to-left, to obtain 
the same output as the stack-based algorithm. 

These shortcomings are minor, and in exchange 
the transducer gains several significant advantages 
over the stack-based algorithm it approximates: It re- 
quires only a finite amount of memory, runs in linear 
time, and gives immediate output for each letter read. 

Although finite transducers are generally consid- 
ered less sophisticated than stack-based algorithms 
in the sense of computational complexity, the con- 
struction of this transducer is considerably more in- 
tricate than the preceding stack-based algorithm and 
is, in fact, our principal aim in the following. 

Our approach will be to construct a transducer 
Filter ({!?,:}) by 'filling in' the forbidden transitions of 
the automaton A := Det(I)iU- • •□!?„). We will thus tie 
our hands behind our backs at the outset by permit- 
ting the transducer to remember only as much about 
past input as does the automaton A while recognizing 
domain strings. 

Unfortunately, ^'s states will generally preserve 
too little information to facilitate optimal resynchro- 
nization. It is possible, however, to begin with elabo- 
rately constructed, equivalent, non-minimal domains 
that yield an automaton A' := Bct{V[ U - --UVJ 
whose states do preserve just enough information 
to facilitate optimal resynchronization. The trans- 
ducer obtained by 'filling in' the forbidden transi- 
tions of this automaton A' represents the best pos- 
sible (transducer) approximation of the stack-based 
algorithm. We present a preprocessing algorithm 
which produces these equivalent, non-minimal do- 
mains {V[} — OptimizedPi}) at the end of our dis- 
cussion of Method-2 filtering. 

The idea underlying our construction is the follow- 
ing. Suppose that while reading the string a we are 
recognizing an increasingly long string accepted by Vi 
when we encounter a forbidden letter a. In accepting 
a up to this point, the automaton A will have reached 
a certain state s e S{A) that has no outgoing tran- 



sition corresponding to the letter a. Our goal is to 
create such a transition by examining the collection 
of all possible strings that could have placed us in the 
state s and to resynchronize to the state of A that is 
most compatible with the potentially foreign strings 
obtained by appending to these strings the forbidden 
letter a. 

In this situation there will be two natural desires. 
On the one hand, we wish to unambiguously resyn- 
chronize to as specific a domain state as possible; but, 
on the other, we wish to rely on as little of the imag- 
ined past as possible. (We use the term imagined be- 
cause our transducer remembers only the state s G 
S{A) we have reached — not the particular string that 
placed us there.) To reflect these desires, we introduce 
a partial ordering on the collection of potential resyn- 
chronization states {Stj}, where i measures the speci- 
ficity of resynchronization and I the length of imag- 
ined past. 

We now implement this intuition in full detail. Our 
exposition relies heavily on ideas from automata the- 
ory. (We now urge reading App.^lin its entirety.) 

As above, let A := Dct(X'i U • • -UVn) and let S{A) ^ 
S(Vi U • • • U !>„) be the canonical injection provided by 
LemmalUin App.|A| Assume that there is a canonical 
injection S{Vi) U • • • U S'(P„) ^ S{A) and that we can 
therefore regard the sets S{Vi) as subsets of An 
example of this situation is depicted in Fig.|3| A suffi- 
cient condition for the existence of such an injection is 
that each Vi is minimal and that Lang(Pi) ^ Lang(I?() 
for i ^ I. Minimality is far from required, however, 
and the assumption is valid for a much larger class 
of domains. (Put informally, it suffices if we can asso- 
ciate to each state s e S{Vi U • ■ • U P„) a string that 
corresponds to a unique path through ViU- ■ - U V„ — 
one that leads to s.) 

Let T he a transducer with the same states, start 
state, and final states as A, but with the transitions: 

T(r) :={(s,a|/(s'),s'):(s,a,s')eT(^)}, 



where: 



fis') 



i i{Ms')cS{v,), 

X otherwise , 



and where A is a new symbol in the output alpha- 
bet S' indicating that domain labeling was not pos- 
sible, for example, because the partial string read so 
far belongs to more than one or none of the automata 
{Vi}. To recapitulate, the transducer's output alpha- 
bet S' consists of three kinds of symbol: domain labels 
{1 . . . n}, domain-domain transition types {1, 2, . . . ,p}, 
and ambiguity A. 

The transducer T's input, In(T), recognizes pre- 
cisely those strings recognized by the given domains. 
Our goal is to extend T by introducing transitions of 
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Start 




FIG. 3: The domains 7?i and V2 (top) and the automaton 
A = Dct(I'i U V-z) (bottom). Start states are indicated by 
dotted arrows from the word "Start", and final states are 
darkened. Notice that the states of A correspond to collec- 
tions of states of Di and Vz and that the former are canoni- 
cally injected into the latter, here by the map n ^ [n]. 



the form: 

{{s,a\h{s,a),g{s,a)) 



s e S{T) = SiA),ae S, 
and there are no transitions 
of the form (s, a, •) e T{A)} , 



where the functions g{s, a) and h{s, a) are de- 
fined in the following paragraphs. The transducer 
Filter({I?j}) obtained by adding these transitions to 
T will then have the desired property that its input 
In(Filtcr({X',})) will accept all strings |29]. 

Let Wi denote the collection of strings correspond- 
ing to length-^ paths through A beginning in any of its 
states, but ending in state s, and let Wl^^ denote the 
collection of strings obtained by appending the letter a 
to the strings of VF/. The strings U;>o ^/ accepted 
by the finite automaton obtained by adding a new 
state / and a transition (s, a, /) to A, and by setting 
Start(^"^'') := SiA"'") and Final(^''^'^) := {/}. An ex- 



ample is shown in Fig.|4l where the four-state domain 
has a transition added from state [2] on symbol 1, 
which was originally forbidden. 





FIG. 4: The semi-deterministic automaton A^'^^'^ (top) ob- 
tained by adding a state / = [9] and its deterministic ver- 
sion Det(^'^''^) (bottom) with states relabeled with the in- 
tegers 1 ... 17 in order to simplify later diagrams. 

In order to choose the resynchronization state 
g{s,a) for the forbidden transition (s,a), we examine 
the strings of 1J;>q M^/ that also belong to one or more 
of the domains {V^}. We do this by constructing the 
automaton Det(^'*''^) n^, which we call the resynchro- 
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nization automaton. By Lemma |3l there is a canoni- 
cal, although not necessarily injective, association: 

(p : 5(Det(^''''') n ^) ^ S{A) 

given by the composition: 

S'(Det(^"''^)n^) ^ S{Det{A'-''))x SiA) ^ S{A) , 

where the right-most map is the second-factor projec- 
tion, (s, s') s'. 

The resynchronization automaton Det(^''''^) n A 
may reveal several possible resynchronization states. 
To help distinguish among them, we put them into 
sets {Stj} where i measures the specificity of resyn- 
chronization and I the length of imagined past. More 
precisely, let 5*^,; denote those states s e S{A) to which 
(/) associates at least one state s' e Final(Det(^'*''') n^) 
(i.e. s = 4>{s')) satisfying the following two conditions: 
(1) s corresponds, under Lemma|2l to precisely i states 
of Pi U • • ■ U Vn and (2) there is a length-/ path from 
the unique start state of Det(^'*''^) n ^ to s'. 

Give the sets {Si^i} the dictionary ordering; that 
is, let Si^i < Siij- if i < i' or if i = i' M < I'. 
The set S'|5(^)| q consists of the unique start state of 
Det(^'' n A. Thus, by the well ordering principle, 
there must be a unique, least set among the sets {5, j} 
that consist of a single state, say {s'}. Let g{s, a) s', 
and let h{s,a) :— h'{s,s') ~ h'{s,g{s,a)), where h' is 
any injection S{T) x S{T) ^ S' (chosen independent 
of s and a). An example of this construction is shown 
in Fig. m 

The transducer is completed by repeating the above 
steps for all forbidden transitions. 



Computability of the transducer Filter{{Vi}) 

Although the transducer Filter ({2?^}) is well de- 
fined, it is perhaps not immediately clear that it is 
computable. After all, we appealed to the well order- 
ing principle to obtain a least singleton set {s'} among 
the sets {5*^ J. In fact, infinitely many sets Sij pre- 
cede the stated upper bound S\s{a)\.o — for instance, 
all of the sets 5*1, n do, provided 15(^)1 > 1. 

The construction is nevertheless computable, be- 
cause for each i the sequence of sets 5^ ^ must even- 
tually repeat. In fact, we can compute this sequence 
of sets exactly by automata-theoretic means. 

Proposition 4. The transducer Filtcr({2?i}) is com- 
putable. 

Proof. Let Z[C] denote the automaton obtained by 
relabeling all of the automaton C's transitions with 
Os. This automaton will almost certainly be nonde- 
terministic. The equivalent deterministic automa- 
ton Det(Z[C]) is useful, because the state it reaches 
when accepting the string O' corresponds precisely. 
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FIG. 5: The resynchronization automaton Det(yl'^''^) n A 
(top). Here Si.s consists of the state (13, [6]) alone, and all 
other 5*1,. are empty. So we choose s' — [6] and add a tran- 
sition ([2], l\h'{[2], [6]), [6]) to T (bottom). 



under Lemma 121 to the collection of states that can 
be reached by length-/ paths through C. 

Moreover, since Dct(Z[C]) is defined over a single 
letter, yet deterministic and finite, it must have a 
special graphical structure: its single start state so 
must lead to a finite loop after a finite chain of non- 
recurrent states. (Actually, if C has no loops whatso- 
ever, there will not even be a loop.) Thus, its states 



So 







si 















have a linear ordering: 
■Sm+i -'^ ■ ■ ■ Sm+m' — * Sm- Au example is illustrated 
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in Fig.|6l where to = 4 and to' = 0. 

By Lemma 121 the states {sk} correspond to collec- 
tions of states of C under an injection: 

Mc] ■■ 5(Det(Z[C])) ^{S c S{Z[C])} 
={SciS{C)}. 

Let C := Det(^*'") n ^ in the preceding discussion. 
As before, by Lemma|3l there is a function: 

Let 5*,; c S{A) denote those states defined by the 
formula: 

S*,i ■■= 0[V'z[c](sO n Final(Det(^^^") n A)] . 



Start 
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(2,[1,2,3.4,5.6,7.8]) 




(3,[1,6,7,8]) I 
\ 




(6.[2,3,4,5]) I I (7,[2,5]) 
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joy 
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[(2,[1, 2,3.4,5,6,7,8])] 


[(3.[1,6,7,8]),(6,[2,3,4,5])] 







[(4,[1,6]),(7,[2,5]),(9,[3,4]),(15,[7,8])] 


|[(l.[l]),(5,[2]),(8,[3]).(l0,[4]).(n,[5]),(l3.[6]),(l4.[7]).(l6,[8])]1 







|[(l,[l]),(5,[2]),(8.[3]).(10,[4]).(ll,[5]),(12.[6]).(14.[7]).(16.[8])rt 30 

FIG. 6: The automaton Z[Det{vf^'^) n A] (top) and its de 
terministic version Det(.E[Det(r'f''^) n A]) (bottom). 



Finally, let Si^^ denote those states of A that cor- 
respond to precisely i states of 2?i U • • • U P„; that is, 
let: 



{seS{A) : \(t>Ais)\ 



■ U X>„)} is the 



where 0^ : S{A) {5 c 5(2?! U 
injection provided by Lemma|2j 

The sets Si^i can then be computed as the intersec- 
tions n S*,.;, and we need only examine these for 

1 < i < |5(2?ij| H + \S{Vn)\ and < / < 771 + to' to 

discover the least one under the dictionary ordering 
that is a singleton {s'}. □ 

We summarize the entire algorithm. 
Algorithm 2. Input: The regular domains 

Vi,...,Vn. 

- Let A := Det(Pi U • • • U P„). 

- Choose any injection h' : S{A) x S{A) ^ S'. 

- Make A into a transducer T by adding the sym- 
bol i as output to any transition ending in a state 
corresponding to states of only one domain Vi 
and by adding As as output symbols to all other 
transitions. 

- For each forbidden transition (s, a) G S{A) x 
y^iA), add a transition to T through the follow- 
ing procedure do 

- Construct the automaton A'^''' by adding to 
A the transition (s, a, /), where f is a new 
state, and by letting f be its only final state. 

- Construct the automaton Dct(Z[Dct(^'''") n 
A]), where Z[C] is the automaton obtained 
by relabeling all of C's transitions with Os. 
Its states will have a natural linear order- 
ing So ^ Si ^ ■ • ■ Sra+rn'- 

- Let Si^^ and 5,,; be the subsets of S{A) de- 
fined by: 

S*,i := 0[V'Dot(2[Dot(^=.»)n^])('SO 

n Final(Det(y4'*''') n A)] and 
5,,, ■.^{seS{A):\^JAis)\^i}. 

- Find the singleton set {s'} among the sets: 

{5,,, n5,,, : 1 < i < \Vi\ + ■■■ + |P„|, 
< / < TO + m'} 

that occurs first under the dictionary order- 
ing. 

- Add the transition (s, a\h'{s, s'), s') to T. 
Output: Filtcr({r'J) := T. 



Algorithmic Complexity 

Proposition 5. The worst-case performance of the 
transducer-constructing algorithm (Algorithm ^ has 
order no greater than: 

• (|S| - 1) •expoexp(2- + 1) , 
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where \A\ has order expdDil + • • ■ + I'E'nl). 

Proof. The algorithm's most expensive step is the 
computation of Dct(Z[Dct(^'*'") n A\). Unfortunately, 
because computing Dct{Q) has order cxp(|CJ|), and be- 
cause computing Qr\H has order • this compu- 
tation has order exp o exp (2 • |^| + 1). 

Finally, recall that the algorithm computes 
Det(Z[Dct(^'* °) n A\) for every forbidden transition 
(s, a) of A. A rough upper bound for the number of 
such transitions is |^| • - 1). From these two 
upper bounds the proposition follows. □ 

Although this analysis may at first seem to ob- 
jurgate the transducer-constructing algorithm, the 
reader should realize that, once computed, T can be 
very efficiently used to filter arbitrarily long strings. 
That is, unlike the stack-based algorithm, its perfor- 
mance is linear in string length. Thus, one pays dur- 
ing the filter design phase for an efficient run-time 
algorithm — a trade-off familiar, for example, in data 
compression. 

Constructing optimal transducers from non-minimal 
domains, a preprocessing step to Algorithm|2l 

Recall that we constructed the transducer 
Filter({I?i}) by 'filling in' the forbidden transi- 
tions of the automaton A := Det(I?i U • • • U P„). This 
proved somewhat problematic, however, because ^'s 
states do not always preserve enough information 
about past input to unambiguously resynchronize 
to a unique, recurrent domain state. In order to 
help discriminate among the several possible resyn- 
chronization states, we introduced the partially 
ordered sets {S*,,;}. But even so, several attractive 
resynchronization states often fell into the same set 
Si^i. So, lacking any objective way to choose among 
them, we resigned ourselves to a less attractive 
resynchronization state occurring in a later set Siiji, 
simply because it appeared alone there, making 
our choice unambiguous. If only the states of the 
automaton A preserved slightly more information 
about past input, then such compromises could be 
avoided. 

In this section we present an algorithm that splits 
the states of a given collection {Vi } of domains to ob- 
tain an equivalent collection {P-} = Optimize({2?i}) of 
domains that preserve just enough information about 
past input to enable unambiguous resynchronization 
in the transducer obtained by filling in the forbidden 
transitions of the automaton A' := Dct(2?i U • • • U PJJ. 

We will accomplish this by associating to each 
state of the original domains Vi a collection of au- 
tomata that partition past input strings into equiv- 
alence classes corresponding to individual resynchro- 
nization states. We will then refine these partitions 
so that Pi's transition structures can be lifted to 



them and thus obtain the desired domains {2?-} = 
Optiniizc({2?i}). 

This procedure, taken as a preprocessing step to Al- 
gorithm 121 will thus produce the best possible trans- 
ducer for Method-2 multi-regular language filtering. 

We now state our construction formally. If s' G 
S{A), then let As' denote the automaton that is identi- 
cal to the automaton A except that its only final state 
is s' . Additionally, if (s, a) is a forbidden transition of 
the automaton ViU- ■ - U P„, then let B{s, a, s') denote 
the automaton satisfying the formula: 

B{s, a, s')-a = Dct(y^"'") n As' , 

where • denotes concatenation. That is, let B{s, a, s') 
denote the automaton that is identical to the au- 
tomaton Det(^''''^) n As' except that its final states 
are given by {s/ : (s/,a, s^^) e T(o),s^ e Final(o)}, 
where o Det(^* n As'. Note that in most cases 
Lang(B(s, a, s')) will be empty. 

Next we associate to each state s e S{Vi U ■ • • U P„) 
a collection T{s) of automata. If the state s has no 
forbidden transitions, let T{s) := {S*}. If the state s 
has at least one forbidden transition, however, then 
let r(s) denote the collection of automata: 

r(s) := Disjoin({S* • B{s, a, s') : 

is,a,-)^TiA),.s' eSiA)}) , 

where Disjoin({C^}) denotes the coarsest partition of 
Lang(C^) by automata {£e} that is compatible with 
the automata {C^}. That is, Disjoin({C^}) denotes 
the smallest collection {S^} of automata satisfying 
(i) UeLanglfe) = Lang(CT,) and (ii) Lang(C^) n 
Lang(fe) is either empty or equal to Lang(f(;) for all 
7 and e. 

It is possible to compute Disjoin({C^}) inductively 
with the formula: 

Disjoin({Ci,C2, . . . ,C„J) = 

{Ci\(C2u---uc„o}u 

{Ci nC : C e Disjoin({C2, . . . ,C„})} U 
{Ci\C' :C'eDisjoin({C2,...,C„})}. 

Note that U£er(s) Lang(f ) = E* for all states s e 
S{Vi U • • • U Vn). This is because Lang(;B(s, a, s')) con- 
tains only the empty string if s' e S{A) is the unique 
state reached on input a from ^'s starting state — that 
is, if (so, a, s') e T{A), where {sq} = Start(^). 

Our goal is to create for each original domain Vi an 
equivalent domain 2?- by splitting each state s e S{Vi) 
into states of the form (s,£), where £ G r(s). But to 
endow these split states with a transition structure 
equivalent to Vi's, we typically must refine the sets 
r(.s) further. We must construct a refinement T'{s) of 
each r(s) with the property that if (s, a, s') is a transi- 
tion of 2?j, then to each £ € r'(s) there corresponds a 
unique £' e r'(s') with Lang(f • a) c Lang(f ). Given 
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such refinements T'{s), we can take the pairs {{s, £) : 
s e S{V^),£ e r'(s)} as the states of V- and equip 
them with transitions of the form {s,£) A (s',£'), 
and thus obtain an equivalent, but non-minimal, do- 
main V^. 

The following algorithm can be used to compute the 
desired refinements r'(s). 

Algorithm 3. Input: The domain V and the func- 
tion r that assigns to each state s E S{'D) a collection 
r(s) of automata that partition S*. 
- For each state s e S^D), let: 



r'(s) := Disjoin (|J{r'(s, a, s') : (s, a, s) G T{V)} 



where: 

T'{s,a,s') 



{£a}a and 

{{£ ■a)n£' :£ er{s),£' eT{s')}. 



- If r'(s) 7^ r(s) for some state s e SiV), then 
repeat with T' in place ofV. Otherwise: 

Output: T'. 

Proposition 6. Algorithm |5l eventually terminates, 
producing the coarsest possible refinements T'{s) of 
r(s) compatible with Vs transition structure. 

Proof We construct fine, but finite, refinements that 
are compatible with V's transition structure, then use 
this result to conclude that Algorithm|3lmust eventu- 
ally terminate. Moreover, we also conclude that, when 
Algorithm Is] terminates, it produces the coarsest pos- 
sible refinements that are compatible with Vs transi- 
tion structure. 

Let {£i} denote the potentially large, but finite, col- 
lection of automata: 



{£,}ti ■■= Disjoin y r(s) 
\s6S(r>) 

which partition S*. 

We refine the partition {£i} to make it compati- 
ble with Vs transitions by examining the automaton 
T := Det{£i U • • ■ U £n)- Since the automata {£i} 
cover S*, the deterministic automaton T can have 
no forbidden transitions, and all its states must be 
final. Moreover, because the automata {£t} are dis- 
joint, each of T's states must correspond (under the 
canonical injection i/jjr of Lemma |2j to final states of 
precisely one automaton £i. In this way, the automata 
{£i} correspond to a partition of the states of T. 

Since each automaton £i is equivalent to the au- 
tomaton obtained by restricting T's final states to 
those states corresponding (under tpy^) to final states 
of £t, we can refine the partition {£i} by refining this 
partition of T's states. 



Although a coarser refinement may suffice, we can 
always choose the partition consisting of single states. 
That is, if s e S{T), let denote the automaton that 
is identical to the automaton T except that its only 
final state is s. Then {J^s : s g S{J^)} is a refinement 
of the partition {£i} with the special property that for 
each automaton and a e there is a unique 

automaton JT^, such that Ts ■ a ~ Tg' ■ Indeed, since T 
is deterministic, s' is the unique state corresponding 
to a transition (s, a, s') G T{T). 

If we let r"(s) := {Ts' : s' G S{T)] for each state 
s G S{V), then we obtain finite refinements of r(s) 
compatible with Vs transition structure, as desired. 

This result implies that Algorithm [S] must eventu- 
ally terminate. After all, every refinement that Algo- 
rithm [S] performs must already be reflected in r"(.s). 
Moreover, since every refinement that the algorithm 
performs is essential to compatibility with Vs tran- 
sition structure, the algorithm must, upon termina- 
tion, produce the coarsest (smallest) compatible re- 
finement possible. 



□ 



Start 





FIG. 7: The positive-entropy domains Vi and ©2 of the 
binary, next-to-nearest neighbor CA 2614700074. (After 

Ref. uai-) 



When applied to the domains Vi and 2?2 in Fig. [71 
for example. Algorithm |3l produces the equivalent, 
non-minimal domains {V^,V2} = OptimizedPi, X'2}) 
shown in Fig. |8l Notice these domains' many non- 
recurrent states. These have almost no effect on the 
automaton A! := Det(|J, V^). 



TV. APPLICATIONS 

We now present four applications to illustrate how 
the stack-based Algorithm [J and its transducer ap- 
proximation (Algorithms 121 and |3j solve the multi- 
regular language filtering problem. The first is the 
cellular automaton EGA 110, shown previously. Its 
rather large filtering transducer is quite tedious to 
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FIG. 8: The equivalent, non-minimal domains {Pi,2?2} = 
Optimize({X'i, I'2}) obtained by applying Algorithm|3lto the 
positive-entropy domains 7?i and V2 in Fig.[7| (V'l (top) and 
V'2 (bottom).) The "Start" arrows are omitted for clarity (all 
states are starting), and some of the transitions are drawn 
with dashed arrows to help the reader distinguish the re- 
current states. 



construct by hand, but Algorithm |2l produces it hand- 
ily. The second example, EGA 18, which we have 
also already seen, illustrates the stack-based Algo- 
rithm [Js ability to detect overlapping domains. The 
third example shows our methods' power to detect 
structures in the midst of apparent randomness: the 
domains and sharp boundaries between them are 
identified easily despite the fact that the domains 
themselves have positive entropy and their bound- 
aries move stochastically. The example shows the 
use of — and need for — domain-preprocessing (Algo- 




1 Site 42 



FIG. 9: EGA llO's principal domain, sub[(00010011011111)*]. 

rithm|3jl. That is, rapid resynchronization is achieved 
using a filter built from optimized, non-minimal do- 
mains. The final example demonstrates the trans- 
ducer (constructed by Algorithms EJ and |3jl detecting 
domains in a multi-stationary process — what is called 
the change-point problem in statistical time-series 
analysis. This example emphasizes that the methods 
developed here are not limited to cellular automata. 
More importantly, it highlights several of the sub- 
tleties of multi-regular language filtering and clearly 
illustrates the need for the domain-preprocessing Al- 
gorithm |3j 



EcAllO 

First consider EGA 110, illustrated earlier in 
Fig. [2 Its domains are easy to see visu- 
ally; they have the form sub(w*) for some fi- 
nite word w. Its dominant domain is sub(w*) = 
sub[(00010011011111)*], illustrated in Fig.|9j In fact, 
the transducer Filtcr({sub[(00010011011111)*]}), con- 
structed from this single domain, filters EGA llO's 
space-time behavior well; see Fig. 1 101 

Notice, in that figure, the wide variety of particle- 
like domain defects that the filtered version lays bare. 
Note, moreover, how these particles move and collide 
according to consistent rules. These particles are im- 
portant to EGA llO's computational properties; a sub- 
set can be used to implement a Post Tag system 1 2^ 
and thus simulate arbitrary Turing machines L241 . 



ECA 18 

Next, consider EGA 18, illustrated earlier in Fig.|2j 
It is somewhat more challenging to filter, because its 
domain V = sub ([0(0 + 1)]*) has positive entropy. As 
a result, its particles are difficult — although by no 
means impossible — to see with the naked eye. Nev- 
ertheless, the stack-based algorithm filters its space- 
time diagrams extremely well, as illustrated in Fig. |2l 
(right). There, black rectangles are drawn where 
maximal substrings overlap, and vertical bars are 
drawn where maximal substrings abut. As men- 
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FIG. 10: An EGA 110 space-time diagram (left) filtered by the transducer Filter({sub[(00010011011111)* ]}) (right). 



tioned earlier, these particles, whose precise location 
is somewhat ambiguous, follow random walks and 
pairwise annihilate whenever they touch 1 20, 21, 22]. 

It is worth mentioning that the transducer 
Filter({I?}) produces a less precise filtrate in this 
case — and that Filtcr(Optimizc({P})) does no better. 
Indeed, since breaks in EGA 18's domain have the 
form • • • 1(0^") 1 • • • , the precise location of the domain 
break is ambiguous: if reading left-to-right, it does 
not occur until the 1 on the right of 0^" is read; 
whereas, if reading right-to-left, it does not occur un- 
til the 1 on the left is read. In other words, if reading 
left-to-right, the transducer Filtcr({I>}) detects only 
the right edges of the black triangles of Fig. |2l (right). 
Similarly, if reading right-to-left, it detects only the 
left edges of these triangles. In this case it is possi- 
ble to fill in the space between these pairs of edges to 
obtain the output of the stack-based algorithm. 



Ca 2614700074 

Now consider the binary, next-to-nearest neighbor 
(i.e. k=r=2) CA 2614700074, shown in Fig.lTTl Crutch- 
field and Hanson constructed it expressly to have the 
positive-entropy domains Vi and 2?2 in Fig.[71l 19]. 

As illustrated in Fig. 1111 the optimal transducer 



Filter(0ptimize({2?i, 2?2})) filters this OA's output well. 
This illustrates a practical advantage of multi-regular 
language filtering: it can detect structure embedded 
in randomness. Notice how the filter easily identifies 
the domains and sharp boundaries separating them, 
even though the domains themselves have positive 
entropy and their boundaries move stochastically. 

It is worth noting that in place of the gray regions 
of Fig. so clearly identified by the optimal trans- 
ducer as corresponding to the second domain 2?2, the 
simpler transducer FilterdPi, P2}) produces a regu- 
lar checkering of false domain breaks (not pictured). 
This is because, when examining the sole forbid- 
den transition (s,a) — (2,1) of the first domain 2?i, 
Algorithm |2l discovers that the first non-empty set 
5i=i,;=4 = {2,4,5} contains three resynchronization 
states. It unfortunately abandons both states 4 and 5, 
which belong to the second domain, instead choosing 
to resynchronize to the original state 2 itself, because 
it occurs alone in the next set 6*1,5. As a result, the 
transducer Filter ({ 2?i, X'2}) has no transitions leav- 
ing the first domain whatsoever and is therefore in- 
capable of detecting jumps from the first domain to 
the second. This is why it prints a checkering of do- 
main breaks instead of correctly resynchronizing to 
the second domain. The optimal transducer does not 
suffer from this problem, because Algorithm |3l splits 
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FIG. 11: Binary, next-to-nearest neighbor CA 2614700074 space-time diagram (left) filtered by the transducer 
Filter(OptimizG({I'i, 7?2})) (right). The white regions on the right correspond to the domain Pi, the gray to the domain 
T>2- The black squares separating these regions correspond to the interruption symbols h'{s, s') that the transducer emits 
between domains. 



state 2 into several new ones, from which unambigu- 
ous resynchronization to the appropriate state — 2, 4, 
or 5 — is possible. 

Change-Point Problem: Filtering Multi-Stationary 
Sources 

Leaving cellular automata behind, consider a bi- 
nary information source that hops with low probabil- 
ity between the two three-state domains Vi and P2 in 
Fig. (top). This source allows us to illustrate sub- 
tleties in multi-regular language filtering and, in par- 
ticular, in the construction of the optimal transducer 
Filter(Optimize({I?i}) can be. 

To appreciate how subtle filtering with the do- 
mains Vi and P2 is — and why the extra states of 
Optimize({I?i, I?2}) are needed to do it — consider the 
following. First choose any finite word w of the form: 

As the ambitious reader can verify, both of the 
strings 1011 11 wand 11 Oui belong to the domain 2?2- In 
fact, both correspond to unique paths through Vi U I?2 
ending in state 5 of Fig.[T2l(top). 



On the other hand, the strings Ollllwl and lOwl 
are also domain words — the first belonging to P2, but 
the second belonging to Pi. In fact, Ollllwl corre- 
sponds to a unique path through Vi U P2 ending in 
state 6, while lOwl corresponds to a unique path end- 
ing in state 3. 

As a result, these four strings are the maximal 
substrings of the non-domain strings lOllllwl and 
llOwl, as indicated by the brackets below: 

corresponds to a unique path 
through V2 ending in state 5 
, ^ 

1 P 11 1 1 w 1 

corresponds to a unique path 
through 172 ending in state 6 

corresponds to a unique path 
through 2?2 ending in state 5 

TToTy 1 

corresponds to a unique path 
through Vi ending in state 3 

This example illustrates several important points. 
First of all, it shows that when the naive transducer 



14 



Start 




FIG. 12: Two similar three-state domains Pi (top left) 
and ©2 (top right) illustrate how subtle the construction 
of the optimal transducer Filtcr(0ptimize({7?i})) can be: 
the automaton A' := Det(|J Optimize({I?i, ©2})) (below), 
from which the optimal transducer is constructed, has 69 
states — the unoptimized automaton A := Det(X'i LJ7?2) (not 
pictured) has 30. 



Filter({I?i, 2?2}) reaches the forbidden letter 1 at the 
end of either of these two strings, the state 2 reached 
does not preserve enough information to resynchro- 
nize to the appropriate state — 3 or 6, respectively. As 
a result, it must either make a guess — at the risk 
of choosing incorrectly and then later reporting an 
artificial domain break (as in the preceding cellular 



automaton example) — or else jump to one of its non- 
recurrent states, emitting a potentially long chain of 
As until it can re-infer from future input what was 
already determined by past input. 

As unsettling as this may be, the example illus- 
trates something far more nefarious. Since an ar- 
bitrarily long word w can be chosen, it is impos- 
sible to fix the problem by splitting the states of 
Filter({Pi, P2}) so as to buffer finite windows of past 
input. In fact, because w is chosen from a language 
with positive entropy, the number of windows that 
would need to be buffered grows exponentially. 

At this point achieving optimal resynchronization 
might seem hopeless, but it actually is possible. This 
is what makes Algorithm ^ — and in particular the 
proof that it terminates (Prop. |6jl — not only surpris- 
ing, but extremely useful. 

Indeed, recall that instead of splitting states ac- 
cording to finite windows. Algorithm 13] splits them 
according to entire regular languages of past in- 
put and that, by Prop. |6l a finite number of these 
regular languages will always suffice to achieve 
optimal resynchronization. And so, instead of 
reaching the same original state 2 when reading 
the strings lOllllw and llOw, the optimal trans- 
ducer Filter(Optimize({I?i, 2)2})) reaches two distinct 
states (2,5) and (2,£'), where lOllllw e Lang(5) 
and llOw € Lang(£'). These two split states 
are labeled with the enlarged integers 15 and 
13, respectively, in Fig. (bottom), which shows 
A' := Det(U0ptimize({Pi,P2}))— the automaton 
from which Filter(Optiniize({I?i, I?2})) is constructed. 
As illustrated in that figure, the optimal transducer 
has 69 states — the unoptimized automaton A := 
Det(X'i U V2) (not pictured) has 30. 



V. CONCLUSION 

We posed the multi-regular language filtering prob- 
lem and presented two methods for solving it. The 
first, although providing the ideal solution, requires 
a stack, has a worst-case compute time that grows 
quadratically in string length and conditions its out- 
put at any point on arbitrarily long windows of fu- 
ture input. The second method was to algorithmi- 
cally construct a transducer that approximates the 
first algorithm. In contrast to the stack-based algo- 
rithm it approximates, however, the transducer re- 
quires only a finite amount of memory, runs in lin- 
ear time, and gives immediate output for each let- 
ter read — significant improvements for cellular au- 
tomata structural analysis and, we suspect, for other 
applications as well. It is, moreover, the best pos- 
sible approximation with these three features. Fi- 
nally, we applied both methods to the computational- 
mechanics structural analysis of cellular automata 
and to a version of the change-point problem from 
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time-series analysis. 

Future directions for this work include generaliza- 
tion both to probabilistic patterns and transducers 
and to higher dimensions. Although both seem diffi- 
cult, the latter seems most daunting — at least from 
the standpoint of transducer construction — because 
there is as yet no consensus on how to approach the 
subtleties of high-dimensional automata theory. (See, 
for example, Refs. 1 25] and [26] for discussions of two- 
dimensional generalizations of regular languages and 
patterns.) Note, however, that the basic notion of 
maximal substrings underlying the stack-based al- 
gorithm is easily generalized to a broader notion of 
higher-dimensional maximal connected subregions, 
although we suspect that this generalization will be 
much more difficult to compute. 

In the introduction we alluded to a range of addi- 
tional applications of multi-regular language filter- 
ing. Segmenting time series into structural compo- 
nents was illustrated by the change-point example. 
This type of time series problem occurs in many ar- 
eas, however, such as in speech processing where the 
structural components are hidden Markov models of 
phonemes, for example, and in image segmentation 
where the structural components are objects or even 
textures. One of the more promising areas, though, 
is genomics. In genomics there is often quite a bit 
of prior biochemical knowledge about structural re- 
gions in biosequences. Finally, when coupled with 
statistical inference of stationary domains, so that 
the structural components are estimated from a data 
stream, multi-regular language filtering should pro- 
vide a powerful and broadly applicable pattern detec- 
tion tool. 
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APPENDIX A: AUTOMATA THEORY 
PRELIMINARIES 

In this appendix we review the definitions and re- 
sults from automata theory that are essential to our 
exposition. A good source for these preliminaries is 
Ref 1 27], although its authors employ altogether dif- 
ferent notation, which does not suit our needs. 



Automata 

An automaton A over an alphabet T.{A) is a 
collection of states S{A), together with subsets 
Start (^), Final(^) C S{A), and a collection of transi- 
tions T{A) C S[A) X 11{A) X S{A). We call an automa- 
ton finite if both S{A) and r(^) are. 

An automaton A accepts a string a = 
0102 • • • On if there is a sequence of transitions 
(si,ai,S2),(s2,a2,S3),...,(s„-i,a„,s„) e T{A) such 
that si e Start(.A) and s„ e Final(^). Denote the 
collection of all strings that A accepts by Lang(^). 
Two automata A and B are said to be equivalent if 
Lang(^) = Lang(B). 

We can think of an automaton as a directed graph 
whose edges are labeled with symbols from Y;{A). 
In this view, an automaton accepts precisely those 
strings that correspond to paths through its graph be- 
ginning in its start states and ending in its final ones. 

An automaton A is said to be semi-deterministic 
if any pair of its transitions that agree in the first 
two slots are identical, that is, any pair of transi- 
tions of the form (si,a, S2) and (si,a, S2) £ T{A) sat- 
isfy S2 = A deterministic automaton is one that is 
semi-deterministic and that has a single start state. 
If A is deterministic, then each string of Lang(^) cor- 
responds to precisely one path through ^'s graph. 

For two automata A and B, let ^ U B denote 
their disjoint union — the automaton over the alpha- 
bet S(^) U whose states are the disjoint union 
of the states of ^ and B, i.e. S{A UB) ^ S{A) U S{B) 
(and similarly for its start and final states) and whose 
transitions are the union of the transitions of A and 
B. In this way Lang(^ UB) = Lang(y^) U Lang(B). 

In this terminology, a domain is a semi- 
deterministic finite automaton V whose states are all 
start and final states, i.e. Start(2?) = S{V) = Final(2?), 
and whose graph is strongly connected — i.e., there is 
a path from any one state to any other. 

Finally, a domain V is said to be minimal if all 
equivalent domains V satisfy \S{V)\ < \S{V')\. 



Standard Results 

Lemma 2. Every automaton A is equivalent to a 
deterministic automaton Det(^). Moreover, Det(^)'s 
states correspond uniquely to collections of A's 
states; in other words, there is a canonical injection 

5(Det(^)) {S-.Sd S{A)}. 

Lemma 3. // A and B are automata, then there is an 
automaton Af^B that accepts precisely those strings 
accepted by both A and B; that is, Lang(^ r\ B) = 
Lang(.A) n Lang(;B). If A and B are deterministic, 
then so is AC\ B. Moreover, there is a canonical in- 
jection S{A f\ B) '—^ S{A) X S{B), which restricts to 
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injections Start H B) ^ Start (^) x Start and 
Fmal{A nB)^ Final(^) x Final(S). 



Transducers 

A transducer T from an alphabet S(T) to an alpha- 
bet S'(T) is an automaton on the alphabet S(T) x 
I]'(T). We will use the more traditional notation 
(s, b\c, s') in place of (s, {b, c), s') e T{T). 

The input of a transducer T is the automaton 
In(T) whose states, start states, and final states are 
the same as T's, but whose transitions are given by 
T(In(r)) := {(s,6,s') : [s,b\c,s') E T{T)}. Simi- 
larly, the output of a transducer T is the automaton 
Out(T) whose transitions are given by T(Out(T)) := 
{{s,c, s') : {s,b\c,s')eT{T)}. 

A transducer T is said to be well defined if In(T) is 
deterministic, because such a transducer determines 
a function from Lang(In(T)) onto Lang(Out(T)). 



APPENDIX B: IMPLEMENTATION 

In order to give the reader a sense for how the al- 
gorithms can be implemented, we rigorously imple- 
ment Algorithm 121 here in the programming language 
Haskell 1 28]. Haskell represents the state of the art 
in polymorphicly typed, lazy, purely functional pro- 
gramming language design. Its concise syntax en- 
ables us to implement the algorithm in less than a 
page. Haskell compilers and interpreters are freely 
available for almost any computer |30]. 

We emulate our exposition in the preceding sections 
by representing a finite automaton as a list of starting 
states, a list of transitions, and a list of final states, 
and a transducer as a finite automaton whose alpha- 
bet consists of pairs of symbols: 

data FAs i ^ FA{faStarts :: [s], 

faTrans :: [(s, i, s)], 



faFinals :: [s]} 

type Transducer s i o ~ FA s {i, o) 

We need the following simple functions, which com- 
pute the list of symbols and states present in an au- 
tomaton: 

faAlphabet :: Eq i FA s i —>■ [i] 

faAlphabet fa — nub [a \ a, _) <— faTrans fa] 

transStates :: Eq s =^ [(s, i, s)] — > [s] 
transStates trans — 

nub $ foldl (Xss (s, _, s') — > s : s' : ss) [] trans 

faStates :: Eq s ^ FA s i — > [s] 
faStates fa = foldl union [] [faStarts fa, 

transStates $ faTrans fa, 
faFinals fa] 

We also require the following three functions: the 
first two implement Lemmas 121 and |3l and the third 
computes the disjoint union of a list of automata. To 
expedite our exposition, we provide only their type 
signatures: 

faDet :: {Ord s, Eq i) =^ FA s i ^ FA [s] i 
falntersect :: (Eq s, Eq s' , Eq i) 

FA s i ^ FA s' i -> FA (s, s') i 
faDisjointUnion :: [FA s z] — > FA {Int,s) i 



Notice that the first two functions return automata 
whose states are represented as lists and pairs of the 
argument automata's states; these representations 
intrinsically encode the Lemmas' canonical injections. 
The function faDisjointUnion returns an automaton 
whose states are represented as pairs {i,s) where s 
is a state of the ith argument automaton Vi. 

We use these representations frequently in the fol- 
lowing implementation of Algorithm|21 



transducerFilterFromDomains :: (Eq s, Ord s, Eq i) [FA s i] —> Transducer [{Int, s)] i Int 
transducer Filter FromDomains faDs ~ 

FA (faStarts faA) (baseTTrans 4f newTTrans) (faFinals faA) 
where faA = faDet $ faDisjointUnion faDs — A :~ Det(Pi U • • • U I?„) 
baseTTrans = [(s, (a,f s'), s') | (s, a, s') ^ faTrans faA] 

where / ss | length is = 1 ~ head is — transition ending in Vi 
I otherwise =0 ~ synchronization, A 
where is ^ nub $ map fst ss 
forbiddenPairs = [(s, a) | s <— faStates faA, a <— faAlphabet faA] \\ [(s, a) \ (s, a, _) <— faTrans faA] 
newTTrans = map newTransition forbiddenPairs 
newTransition (s, a) = (s, (a, o), s') 

where faAsa = (FA (f : faStates faA) ((s, a,f ) : faTrans faA) [/]) ~ the automaton A^'°' 

f — [(1 + length faDs, head $ faStarts $ head faDs)] — the fresh state / used to build A^'"' 
faDetAsaCapA = (faDet faAsa) 'falntersect' faA — the automaton Det(^*''^ n A) 
faZ = faDet S zero faDetAsaCapA - the automaton Det(Z[Det(^''''' n A)]) 
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reachableStateSeq = take {length $ faStates faZ) — the states si, . . . , Sm+m' 
{iterate nextState $ head $ faStarts faZ) 
where nextState s = head [s" \ {s', _, s") <— faTrans faZ, s' = s] 
sStarLs = map {{map snd) o — the sets {<S'*,i}™|™ 

{intersect % faFinals faDetAsaCapA)) reachableStateSeq 
siStars = [nub — the sets {Si^^}^^^ 

[s I s <— map snd $ faFinals faDetAsaCapA, length s = i] 
I z ^ [1 . . length $ head $ faStarts faA]] 
sijs = concatMap {XsiStar — > map {intersect siStar) sStarLs) siStars — the sets {Sij}ij 
[s'] = head $ filter {Xsij length sij = 1) sijs — the state s' to which to synchronize 
o = — 1 ~ domain break 
zero fa = FA {faStarts fa) [{s, 0, $') \ {s, _, $') <— faTrans fa] {faFinals fa) 



To implement the domain optimization algo- overwhelmingly — complicated. In fact, we generated 
rithm Optimize(») is somewhat more — but not the exEimples here by computer, rather than by hand. 
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